Motivation

At useR! 2017 I saw a poster that blew my mind. It was Alex Kruse’s map of Hamburg that showed common routes bike share users would take. I’ve always wanted to reproduce it so here I am doing my best to make something similar with DC’s Capital Bike Share open data.

Most of the techniques I use here were learned by reading Geocomputation with R, exploring Stack Overflow, and by reading different vignettes from the R community.

Setup

library(tidyverse)
library(tmaptools)
library(tmap)
library(osmdata)
library(stplanr)
library(igraph)
library(sf)

DC Neighborhoods

Data can be found here.

dc <- st_read("Neighborhood_Clusters/Neighborhood_Clusters.shp")
## Reading layer `Neighborhood_Clusters' from data source `/Users/ben/Neighborhood_Clusters/Neighborhood_Clusters.shp' using driver `ESRI Shapefile'
## Simple feature collection with 46 features and 7 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -77.1198 ymin: 38.79164 xmax: -76.90915 ymax: 38.99597
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
summary(dc)
##     OBJECTID                        WEB_URL           NAME   
##  Min.   : 1.00   http://planning.dc.gov/:39   Cluster 1 : 1  
##  1st Qu.:12.25   NA's                   : 7   Cluster 10: 1  
##  Median :23.50                                Cluster 11: 1  
##  Mean   :23.50                                Cluster 12: 1  
##  3rd Qu.:34.75                                Cluster 13: 1  
##  Max.   :46.00                                Cluster 14: 1  
##                                               (Other)   :40  
##                                            NBH_NAMES    Shape_Leng   
##  Arboretum, Anacostia River                     : 1   Min.   : 1990  
##  Brightwood Park, Crestwood, Petworth           : 1   1st Qu.: 6920  
##  Brookland, Brentwood, Langdon                  : 1   Median : 8883  
##  Capitol Hill, Lincoln Park                     : 1   Mean   :10457  
##  Capitol View, Marshall Heights, Benning Heights: 1   3rd Qu.:10654  
##  Cathedral Heights, McLean Gardens, Glover Park : 1   Max.   :57862  
##  (Other)                                        :40                  
##    Shape_Area               TYPE             geometry 
##  Min.   :  312062   Additional: 7   POLYGON      :46  
##  1st Qu.: 1943120   Original  :39   epsg:4326    : 0  
##  Median : 3433941                   +proj=long...: 0  
##  Mean   : 3857591                                     
##  3rd Qu.: 4497491                                     
##  Max.   :21175571                                     
## 
tm_shape(dc) +
  tm_borders('hotpink') +
  tm_layout(title = 'DC Neighborhood Clusters',
            title.color = 'white',
            bg.color = 'grey5')

#create a bbox of dc
dc_bbox <- st_bbox(dc)

#create a simple feature collection of major roads in dc
dc_roads <- c('primary', 'secondary', 'tertiary', 'cycleway') %>%
  map(function(x){
    opq(bbox = dc_bbox) %>%
      add_osm_feature(key = 'highway', value = x) %>%
      osmdata_sf() %>%
      .$osm_lines %>%
      select(geometry, highway)
  }) %>%
  do.call(rbind, .)
tm_shape(dc) +
  tm_borders('hotpink4', 1) +
  tm_fill('hotpink3', .05) +
  tm_shape(filter(dc_roads, highway == 'primary')) +
  tm_lines('grey100', 1) +
  tm_shape(filter(dc_roads, highway == 'secondary')) +
  tm_lines('grey75', .5) +
  tm_shape(filter(dc_roads, highway == 'tertiary')) +
  tm_lines('grey50', .5) +
  tm_shape(filter(dc_roads, highway == 'cycleway')) +
  tm_lines('grey25', .5) +
  tm_layout(title = 'DC Roads',
            title.color = 'white',
            bg.color = 'grey5')

Capital Bike Share

Station information can be found here. There is a lot of information here, but we only want the longitude, latitude, and an identifyer.

stations <- read_csv('Capital_Bike_Share_Locations.csv') %>%
  select(lon = LONGITUDE,
         lat = LATITUDE,
         name =ADDRESS) %>%
  st_as_sf(coords = c('lon', 'lat')) %>%
  st_set_crs(4326)

All the Origin-Destination routes for 2017 can be found at https://s3.amazonaws.com/capitalbikeshare-data/index.html. They’re divided into quadrants so we will need to bind the data togeather. Like the station data, we are given a lot of information we don’t really need so we will only keep the starting location, the ending location, and the month that the journey started on. We will also count the number of times each Origin-Destination journey took place.

trips_2017 <-str_glue('2017-capitalbikeshare-tripdata/2017Q{1:4}-capitalbikeshare-tripdata.csv') %>%
  map(function(x){
    read_csv(x)
  }) %>%
  bind_rows() %>%
  select(start = 'Start station',
         end = 'End station',
         month = 'Start date') %>%
  mutate(month = lubridate::month(month, label = T)) %>%
  count(start, end, month, sort = T) 

head(trips_2017)
## # A tibble: 6 x 4
##   start                                                 end    month     n
##   <chr>                                                 <chr>  <ord> <int>
## 1 Jefferson Dr & 14th St SW                             Jeffe… Apr    1212
## 2 Jefferson Dr & 14th St SW                             Jeffe… Jun    1096
## 3 Lincoln Memorial                                      Jeffe… Jul    1047
## 4 Jefferson Dr & 14th St SW                             Jeffe… Jul    1031
## 5 Smithsonian-National Mall / Jefferson Dr & 12th St SW Smith… Jul     982
## 6 Smithsonian-National Mall / Jefferson Dr & 12th St SW Smith… Apr     962

We need to double check that the two data sets are joinable.

toClean <- trips_2017$start %>%
  c(trips_2017$end) %>%
  unique %>%
  .[!. %in% stations$name] %>%
  sort

toClean
## [1] "14th & D St SE"                    "15th & Crystal Dr"                
## [3] "20th & Crystal Dr"                 "23rd & Crystal Dr"                
## [5] "27th & Crystal Dr"                 "S George Mason Dr & 13th St S"    
## [7] "Solutions & Greensboro Dr"         "USDA / 12th & Independence Ave SW"
potentialStations <- stations$name %>%
  .[str_detect(.,'Crystal|USDA|D St|George Mason|Greensboro')] %>%
  sort

potentialStations
##  [1] "13th & D St NE"                          
##  [2] "14th & D St NW / Ronald Reagan Building" 
##  [3] "1st & D St SE"                           
##  [4] "3rd & D St SE"                           
##  [5] "4th & D St NW / Judiciary Square"        
##  [6] "8th & D St NW"                           
##  [7] "Arlington Blvd & S George Mason Dr/NFATC"
##  [8] "Crystal City Metro / 18th & Bell St"     
##  [9] "Crystal Dr & 15th St S"                  
## [10] "Crystal Dr & 20th St S"                  
## [11] "Crystal Dr & 23rd St S"                  
## [12] "Crystal Dr & 27th St S"                  
## [13] "Crystal Dr & Potomac Ave"                
## [14] "D St & Maryland Ave NE"                  
## [15] "George Mason Dr & Wilson Blvd"           
## [16] "Greensboro & International Dr"           
## [17] "Greensboro & Pinnacle Dr"                
## [18] "Oklahoma Ave & D St NE"                  
## [19] "Pershing & N George Mason Dr"            
## [20] "S George Mason Dr & Four Mile Run"       
## [21] "USDA / 12th & C St SW"
trips_2017_clean <- trips_2017 %>%
  mutate_at(vars(start, end), function(x){
    map_chr(x, function(y){
      if(y == "15th & Crystal Dr") return("Crystal Dr & 15th St S")
      if(y == "20th & Crystal Dr") return("Crystal Dr & 20th St S")
      if(y == "23rd & Crystal Dr") return("Crystal Dr & 23rd St S")
      if(y == "27th & Crystal Dr") return("Crystal Dr & 27th St S")
      if(y == "USDA / 12th & Independence Ave SW") return("USDA / 12th & C St SW")
      if(y == "S George Mason Dr & 13th St S") return("S George Mason Dr & Four Mile Run")
      if(y == "Solutions & Greensboro Dr") return("Greensboro & Pinnacle Dr")
      return(y)
    })
  }) %>%
  filter(start %in% stations$name) %>%
  filter(end %in% stations$name) 

For our exploration, we will focus on roads traveled, not the direction of the routes. So we will consolidate all routes A → B with routes B → A.

trips_2017_clean <- trips_2017_clean %>%
  (function(x){
    a <- filter(x, start < end)
    b <- filter(x, start > end)
    names(b) <- c('end', 'start', 'month', 'n')
    rbind(a, b)
  }) %>%
  count(start, end, month, wt = n, sort = T) 

head(trips_2017_clean)
## # A tibble: 6 x 4
##   start                     end              month    nn
##   <chr>                     <chr>            <ord> <int>
## 1 Jefferson Memorial        Lincoln Memorial Jul    1626
## 2 Jefferson Memorial        Lincoln Memorial Aug    1514
## 3 Jefferson Dr & 14th St SW Lincoln Memorial Jun    1502
## 4 Jefferson Dr & 14th St SW Lincoln Memorial Jul    1462
## 5 Jefferson Dr & 14th St SW Lincoln Memorial Aug    1386
## 6 Jefferson Memorial        Lincoln Memorial Apr    1384
summary(trips_2017_clean$nn)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    5.00   13.23   14.00 1626.00

For computational reasons, let’s only plot routes that have a total number of riders greater than the 3rd quartile.

trips_2017_freq <- trips_2017_clean %>%
  filter(nn > 14) %>%
  select(start, end) %>%
  distinct()

With this information we can plot desire lines. These lines are direct paths between origins and destinations.

dc_desire <- od2line(trips_2017_freq, stations) %>%
  .[as.logical(st_within(., st_as_sfc(dc_bbox))),] %>%
  filter(!st_is_empty(.))

tm_shape(dc_desire) +
  tm_lines('white', alpha = .05) +
  tm_shape(dc) +
  tm_borders('hotpink4', 1) +
  tm_layout(title = 'DC Bike Share as the Crow Flies',
            title.color = 'white',
            bg.color = 'grey5')

This is really cool, but considering that these routes are in a city, they aren’t realistic. There are a number of obstacles in cities (buildings for one…) so we will need to convert these desire lines into actual routes

dc_routes <- line2route(dc_desire, route_fun = route_osrm)

tm_shape(dc_routes) +
  tm_lines('white', alpha = .05) +
  tm_shape(dc) +
  tm_borders('hotpink4', 1) +
  tm_layout(title = 'DC Bike Share as the Crow Flies',
            title.color = 'white',
            bg.color = 'grey5')

Aggregating Lines

So the previous image contained path information. We can combine the path information with the original Bike Share data to see which routes most bikers were riding on.

dc_routes_wt <- dc_routes %>%
  mutate(start = dc_desire$start,
         end = dc_desire$end) %>%
  left_join(trips_2017_clean)
## Joining, by = c("start", "end")
summary(dc_routes_wt)
##     distance          duration         error                id           
##  Min.   :   77.7   Min.   :  16.5   Length:119526      Length:119526     
##  1st Qu.: 1659.0   1st Qu.: 326.4   Class :character   Class :character  
##  Median : 2568.7   Median : 470.2   Mode  :character   Mode  :character  
##  Mean   : 3020.5   Mean   : 487.4                                        
##  3rd Qu.: 3805.0   3rd Qu.: 629.2                                        
##  Max.   :30448.3   Max.   :1626.0                                        
##  NA's   :38        NA's   :38                                            
##     start               end                month             nn         
##  Length:119526      Length:119526      Oct    :10532   Min.   :   1.00  
##  Class :character   Class :character   Sep    :10518   1st Qu.:   8.00  
##  Mode  :character   Mode  :character   Aug    :10453   Median :  16.00  
##                                        Nov    :10333   Mean   :  25.84  
##                                        Jul    :10324   3rd Qu.:  29.00  
##                                        Jun    :10256   Max.   :1626.00  
##                                        (Other):57110                    
##             geometry     
##  MULTILINESTRING:119526  
##  epsg:4326      :     0  
##  +proj=long...  :     0  
##                          
##                          
##                          
## 

For plotting and data cleaning purposes we will only keep routes with more than 200 riders.

dc_routes_wt <- dc_routes_wt %>%
  filter(nn > 200)

There is a big problem however, many routes pass over other routes. If we plot them as they are, then there will be overlapping lines and the information stored on the covered lines will be hidden. What we can do to solve this problem is aggregate the information on the lines. This can be done by identifying overlapping lines and aggregatting the underlying values on those lines.

The below aggregation method was inspired by a stack overflow respondent named ibusett. The original answer can be found here.

dc_routes_intersect <- dc_routes_wt$month %>%
  levels() %>%
  map(function(x){
    filter(dc_routes_wt, 
           month == x) %>%
      st_intersection()
  }) %>%
  do.call(rbind, .)

dc_routes_month <- dc_routes_wt$month %>%
  levels() %>%
  {
    month_list <- map(., function(x){
      filter(dc_routes_wt, month == x)
    }) 
    names(month_list) <- .
    month_list
  }

dc_routes_agg <- dc_routes_intersect %>%
  pmap_int(function(origins, month, ...){
    sum(dc_routes_month[[month]][origins,]$nn)
  })

dc_routes_intersect$agg <- dc_routes_agg

Since we will only plot the lines, we will only need to keep the LINESTRING geometries from the returned geometry collection.

dc_routes_linestrings <-dc_routes_intersect %>%
  st_collection_extract('LINESTRING')

Because we have arranged our data by month, we can facet the images by month. We can either create a single image by faceting or we can create an animation. Below is the code for the animation.

trips_2017_anim <- tm_shape(filter(dc_roads, highway == 'primary')) +
  tm_lines('grey40', .1) +
  tm_shape(filter(dc_roads, highway == 'secondary')) +
  tm_lines('grey30', .5) +
  tm_shape(filter(dc_roads, highway == 'tertiary')) +
  tm_lines('grey20', .5) +
  tm_shape(dc_routes_linestrings) + 
  tm_lines(col = 'agg',
           1,
           title.col = '# Bikes',
           breaks = c(0, 1000, 2000, Inf),
           palette = c('#000dff', '#8086ff', "#ffffff"),
           labels = c('< 1k', '< 2k', '> 2k'))  +
  tm_facets(along = 'month', free.coords = F, drop.units = T)+
  tm_layout(bg.color = 'grey10',
            legend.text.color = 'white',
            title = "2017 DC Capital Bike Share Routes",
            title.color = 'white') 
The monthly mapped routes of DC bike share. High density riding seems to happend during the summer months (June, July, Aug).

The monthly mapped routes of DC bike share. High density riding seems to happend during the summer months (June, July, Aug).

Geo Graphs

Something I’m really excited about is the concept of the geo graph. With the spatial data we already have, we can create a graph where the nodes are the intersections and the edges are the routes.

sln <- dc_routes_linestrings %>%
  as('Spatial') %>%
  SpatialLinesNetwork()

E(sln@g)$agg <- dc_routes_linestrings$agg
E(sln@g)$month <- as.character(dc_routes_linestrings$month)
V(sln@g)$lon <- sln@g$x
V(sln@g)$lat <- sln@g$y  
V(sln@g)$strength <- strength(sln@g, weights = E(sln@g)$agg)
V(sln@g)$btwn <- betweenness(sln@g)

sln@g %>%
  {. - E(.)[month != "Jun"]} %>%
  {. - V(.)[degree(.) == 0]} %>% 
  plot(.,
       layout = matrix(c(V(.)$lon, V(.)$lat), nrow = vcount(.)),
       vertex.size = map_dbl(V(.)$strength / max(V(.)$strength), function(x){
         if(x < .33) return(2)
         if(x < .66) return(6)
         return(12)
       }),
       vertex.label = '',
       vertex.color = map_chr(V(.)$strength / max(V(.)$strength), function(x){
         if(x < .33) return('navy')
         if(x < .66) return('cyan')
         return('gold')
       }),
       main = "DC Bike Share: Intersection Betweeness June 2017") 

V(sln@g) %>%
  .[[strength == max(.$strength)]]
## + 1/603 vertex, from b2234fc:
##          lon    lat strength btwn
## 85 -77.05009 38.887   103211 1765

The intersection of Independence Avenue & Ohio Drive SW by the Lincoln and John Ericsson memorials may be worth looking into since it is one of the most crossed intersections.

Mixing graph analysis with spatial analysis will be a lot of fun and I plan on pursuing this type of joint analysis in the future.